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Abstract 

Based  on  recent  results  for  multi-armed  bandit  problems,  we  propose  an  adaptive  sampling 
algorithm  that  approximates  the  optimal  value  of  a  finite  horizon  Markov  decision  process 
(MDP)  with  infinite  state  space  but  finite  action  space  and  bounded  rewards.  The  algorithm 
adaptively  chooses  which  action  to  sample  as  the  sampling  process  proceeds,  and  it  is  proven 
that  the  estimate  produced  by  the  algorithm  is  asymptotically  unbiased  and  the  worst  possible 
bias  is  bounded  by  a  quantity  that  converges  to  zero  at  rate  O  ( where  H  is  the  horizon 
length  and  N  is  the  total  number  of  samples  that  are  used  per  state  sampled  in  each  stage.  The 
worst-case  running-time  complexity  of  the  algorithm  is  0((|A|fV)^),  independent  of  the  state 
space  size,  where  |A|  is  the  size  of  the  action  space.  The  algorithm  can  be  used  to  create  an 
approximate  receding  horizon  control  to  solve  infinite  horizon  MDPs. 

Keywords:  (adaptive)  sampling,  Markov  decision  process,  multi-armed  bandit  problem,  receding 
horizon  control 
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1  Introduction 

In  this  paper,  we  propose  an  “adaptive”  sampling  algorithm  that  approximates  the  optimal  value 
to  break  the  well-known  curse  of  dimensionality  in  solving  finite  horizon  Markov  decision  processes 
(MDPs).  The  algorithm  is  aimed  at  solving  MDPs  with  a  large  (possibly  infinite)  state  space 
but  with  a  finite  action  space  and  bounded  rewards.  The  approximate  value  computed  by  the 
algorithm  not  only  converges  to  the  true  optimal  value  but  also  does  so  in  an  “efficient”  way. 
The  algorithm  adaptively  chooses  which  action  to  sample  as  the  sampling  process  proceeds  and 
the  estimate  produced  by  the  algorithm  is  asymptotically  unbiased  and  the  worst  possible  bias  is 
bounded  by  a  quantity  that  converges  to  zero  at  rate^  of  O  >  where  H  is  the  length  of 

the  horizon  and  Ni  is  the  total  number  of  samples  which  are  used  per  state  sampled  in  stage  i.  The 
logarithmic  bound  in  the  numerator  is  achievable  uniformly  over  time.  Given  that  the  action  space 
size  is  |y4|,  the  worst-case  running  time-complexity  of  the  algorithm  is  O  ((| A|  maxj=i^...^j:/ Alj)^) , 
which  is  independent  of  the  state  space  size  but  is  dependent  on  the  size  of  the  action  space  due  to 
the  requirement  that  each  action  be  sampled  at  least  once  at  each  sampled  state 

The  idea  behind  the  adaptive  sampling  algorithm  is  based  on  the  expected  regret  analysis  of 
the  multi-armed  bandit  problem  developed  by  Lai  and  Robbins  (1985).  In  particular,  we  exploit 
the  recent  finite-time  analysis  work  by  Auer,  Cesa-Bianchi,  and  Fischer  (2002)  that  elaborated 
Agrawal  (1995).  The  goal  of  the  multi-armed  bandit  problem  is  to  play  as  often  as  possible  the  ma¬ 
chine  that  yields  the  highest  (expected)  reward.  The  regret  quantifies  the  exploration/exploitation 
dilemma  in  the  search  for  the  true  “optimal”  machine,  which  is  unknown  in  advance.  During  the 
search  process,  we  wish  to  explore  the  reward  distribution  of  different  machines  while  also  frequently 
playing  the  machine  that  is  empirically  best  thus  far.  The  regret  is  the  expected  loss  due  to  not 
always  playing  the  true  optimal  machine.  Lai  and  Robbins  (1985)  showed  that  for  an  optimal 
strategy  the  regret  grows  at  least  logarithmically  in  the  number  of  machine  plays,  and  recently 
Auer,  Cesa-Bianchi,  and  Fischer  (2002)  showed  that  the  logarithmic  regret  is  also  achievable  uni¬ 
formly  over  time  with  a  simple  and  efficient  sampling  algorithm  for  arbitrary  reward  distributions 
with  bounded  support.  We  incorporate  their  results  into  a  sampling-based  process  for  finding  an 
optimal  action  in  a  state  for  a  single  stage  of  an  MDP  by  appropriately  converting  the  definition 
of  regret  into  the  difference  between  the  true  optimal  value  and  the  approximate  value  yielded  by 
the  sampling  process.  We  then  extend  the  one-stage  sampling  process  into  multiple  stages  in  a 
recursive  manner,  leading  to  a  multi-stage  (sampling-based)  approximation  algorithm  for  solving 
MDPs. 

This  paper  is  organized  as  follows.  In  Section  2,  we  give  the  necessary  background  and  an 
intuitive  description  of  the  adaptive  sampling  algorithm,  present  a  formal  description  of  the  algo¬ 
rithm,  and  discuss  how  to  create  an  (approximate)  receding  horizon  control  (Hernandez-Lerma  and 
Lasserre,  1990)  via  the  sampling  algorithm  to  solve  MDPs  in  an  “on-line”  manner  in  the  context 
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of  “planning”  for  infinite  horizon  criteria.  In  Section  3,  we  provide  the  proofs  for  the  convergence 
and  the  convergence  rate  of  the  worst-case  bias,  and  in  Section  4,  we  compare  the  algorithm  with 
a  nonadaptive  sampling  algorithm.  In  Section  5,  we  conclude  this  paper  with  some  remarks. 


2  Adaptive  Sampling  Algorithm 


2.1  Background 


Consider  a  finite  horizon  MDP  M  =  (X,  A,  P,  R)  with  countable  state  space  X,  finite  action  space 
A  with  |yl|  >  1,  nonnegative  and  bounded  reward  function  R  such  that  R  :  X  x  A  ^  and 
transition  function  P  that  maps  a  state  and  action  pair  to  a  probability  distribution  over  X.  We 
denote  the  probability  of  transitioning  to  state  y  G  X  when  taking  action  a  in  state  x  G  X  hy 
P{x,a){y).  For  simplicity,  we  assume  that  every  action  is  admissible  in  every  state. 

Let  n  be  the  set  of  all  possible  nonstationary  Markovian  policies  vr  =  {yrtlvTi  :  X  A,  t  >  0}. 
Our  goal  is  to  estimate  the  optimal  discounted  total  reward  (thereby  obtaining  an  (approximate) 
optimal  policy)  for  horizon  length  H,  discount  factor  7,  and  initial  state  xq.  Defining  the  optimal 
reward-to-go  value  function  for  state  x  in  stage  i  by 


V*{x)  =  sup  E 

ttGII 


'H-l 


^  y^R{xt,TTt{xt)) 


L  t=i 


Xj  =  X 


0<7<  l,i  =  0,  ...,iL  —  1, 


with  V^{x)  =  0  for  aA  x  G  X  and  xt  a  random  variable  denoting  the  state  at  time  t  following 
policy  TT,  we  wish  to  estimate  Fq*(xo).  Throughout  the  paper,  we  assume  that  7  is  fixed.  It  is 
well-known  (see,  e.g.,  Bertsekas  1995)  that  V*  can  be  written  recursively  as  follows:  for  all  x  G  X 
and  i  =  0, ...,  H  —  1, 


V*{x)  =  max((5*(x,  a)),  where 

asA 

Qi{x,a)  =  R{x,a) +y^P{x,a){y)V*^^{y)  withVHix)  =  0,x  G  X. 

y&X 

We  remark  that  the  work  here  can  be  extended  to  Borel  state  space  with  appropriate  measure- 
theoretic  arguments,  and  the  assumption  that  we  have  the  zero  terminal  reward  function  (for 
simplicity)  can  be  relaxed  with  an  arbitrary  (bounded)  terminal  reward  function. 

Suppose  we  estimate  Q*{x,  a)  by  a  sample  mean  Qi{x,  a)  for  each  action  a  G  A,  where 

Qi{x,a)  =  R{x,a) +  '^V^^\y),  (1) 

yes- 

where  S^  is  the  multiset  of  (independently)  sampled  next  states  according  to  the  distribution 
P{x,  a),  and  >  1  for  all  x  G  X  and  such  that  J^aeA  ^ai  —  ^  fixed  X*  >  \A\  for  all 

X  G  X,  and  is  an  estimate  of  the  unknown  V*^i{y).  Note  that  the  number  of  next  state 
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samples  depends  on  the  state  x,  action  a,  and  stage  i.  Suppose  also  that  we  estimate  the  optimal 
value  of  V*(x)  by 

a€A  * 

This  leads  to  the  following  recursion: 


V, 


Ni 


N: 


a&A 


y&Si 


^i+1 


(y)  A  =  -  1, 


with  {x)  =  0  for  all  x  G  X  and  any  Nh  >  0. 

In  the  above  definition,  the  total  number  of  sampled  (next)  states  is  0{N^)  with  N  = 
maxj=o,...,H-i  Xj,  which  is  independent  of  the  state  space  size.  One  approach  is  to  select  “op¬ 
timal”  values  of  N^'-  for  i  =  0, H  —  1,  a  G  A,  and  x'  G  X,  such  that  the  expected  error  between 
the  values  of  Vf^°{x)  and  V"q*(x)  is  minimized,  but  this  problem  would  be  difficult  to  solve.  So 
instead  we  seek  the  values  of  for  i  =  0,  ...,H  —  1,  a  G  ^1,  and  x'  G  X  such  that  the  expected 
difference  is  bounded  as  a  function  of  and  Xj,  i  =  0, ...,  X  —  1,  and  that  the  bound  (from  above 
and  from  below)  goes  to  zero  as  Xj,  i  =  0,  ...,X  —  1,  go  to  infinity.  We  propose  an  “adaptive” 
allocation  rule  (sampling  algorithm)  that  adaptively  chooses  which  action  to  sample,  updating  the 
value  of  N^'-  as  the  sampling  process  proceeds,  and  achieves  convergence  such  that  as  Xj  ^  oo  for 
all  i  =  0, H  —  1,  E\y^°{x)]  Vq{x),  and  is  efficient  in  the  sense  that  the  worst  possible  bias 
is  bounded  by  a  quantity  that  converges  to  zero  at  rate  0{Yli  T^)  the  logarithmic  bound  in 
the  numerator  is  achievable  uniformly  over  time. 

As  mentioned  before,  the  main  idea  behind  the  adaptive  allocation  rule  is  based  on  a  simple 
interpretation  of  the  regret  analysis  of  the  multi-armed  bandit  problem,  a  well-known  model  that 
captures  the  exploitation/exploration  trade-off.  An  M-armed  bandit  problem  is  defined  by  random 
variables  for  1  <  i  <  M  and  n  >  1,  where  successive  plays  of  machine  i  yield  “rewards” 
Xj^i,Xj^2v  which  are  independent  and  identically  distributed  according  to  an  unknown  but  fixed 
distribution  5i  with  unknown  expectation  The  rewards  across  machines  are  also  independently 
generated.  Let  Ti{n)  be  the  number  of  times  machine  i  has  been  played  by  an  algorithm  during 
the  first  n  plays.  Define  the  expected  regret  p{n)  of  an  algorithm  after  n  plays  by 

M 

p{n)  =  p* n  —  y  piE\Ti{n)]  where  p*  :=  max^u*. 
i=l 


Lai  and  Robbins  (1985)  characterized  an  “optimal”  algorithm  such  that  the  best  machine,  which 
is  associated  with  p* ,  is  played  exponentially  more  often  than  any  other  machine,  at  least  asymp¬ 
totically.  That  is,  they  showed  that  playing  machines  according  to  an  (asymptotically)  optimal 
algorithm  leads  to  p{n)  =  0(lnn)  as  n  ^  oo  under  mild  assumptions  on  the  reward  distributions. 
Unfortunately,  obtaining  an  optimal  algorithm  (proposed  by  Lai  and  Robbins)  can  sometimes  be 
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very  difficult,  so  Agrawal  (1995)  derived  a  set  of  simple  algorithms  that  achieve  the  asymptotic  log¬ 
arithmic  regret  behavior,  using  a  form  of  upper  confidence  bounds.  During  the  plays,  we  are  temped 
to  take  the  machine  with  the  maximum  current  sample  mean  —  exploitation.  But  the  sample  mean 
for  the  machine  i  is  just  an  estimate  that  contains  uncertainty,  where  h  is  the  number  of  overall 
plays  so  far.  To  account  for  this,  we  add  a  function  cjj(h)  such  that /ij(h)—cjj(h)  <  Pi  <  fii{n)+ai{n) 
with  high  probability,  where  /ij(h)  -|-  (Tj(h)  is  the  upper  confidence  bound  (see  Agrawal,  1995  for 
a  substantial  discussion).  Then  the  width  of  the  confidence  bound  gives  us  guidance  for  explo¬ 
ration.  Indeed,  the  use  of  the  upper  confidence  bound  leads  us  to  trade-off  between  exploitation 
and  exploration,  giving  a  criterion  of  which  of  the  two  between  exploitation  and  exploration  to  be 
selected.  Agrawal’s  algorithm  is  to  choose  the  machine  with  the  highest  upper  confidence  bound  at 
each  play  over  time.  For  bounded  rewards,  Auer,  Cesa-Bianchi,  and  Fischer  (2002)  propose  simple 
upper  confidence-bound  based  algorithms  that  achieve  the  logarithmic  regret  uniformly  over  time, 
rather  than  only  asymptotically,  and  our  sampling  algorithm  primarily  builds  on  their  results. 

For  an  intuitive  description  of  the  allocation  rule,  consider  first  only  the  one-stage  approxima¬ 
tion.  That  is,  we  assume  for  now  that  we  know  V^{x)  for  all  x  G  X.  To  estimate  V)(‘(x),  obviously 
we  need  to  estimate  QQ{x,a*),  where  a*  G  argmax^gy^((5o(x,  a)).  The  search  for  a*  corresponds  to 
the  search  for  the  best  machine  in  the  multi-armed  bandit  problem.  We  start  by  sampling  each 
possible  action  once  at  x,  which  leads  to  the  next  state  according  to  P{x,  a)  and  reward  R{x,  a).  We 
then  iterate  as  follows  (see  Loop  in  Figure  1).  The  next  action  to  sample  is  the  one  that  achieves 
the  maximum  among  the  current  estimates  of  Qq{x,  a)  plus  its  current  upper  confidence  bound  (see 
Equation  (3)),  where  the  estimate  Qo{x,  a)  is  given  by  the  immediate  reward  plus  the  sample  mean 
of  Vj* -values  at  the  sampled  next  states  that  have  been  sampled  so  far  (see  Equation  (4)). 

Among  the  Nq  samples  for  state  x,  N^q  denotes  the  number  of  samples  at  action  a.  If  the 
sampling  is  done  appropriately,  we  might  expect  that  provides  a  good  estimate  of  the  prob- 
ability  that  action  a  is  optimal  in  state  x.  In  the  limit  as  Nq  oo,  we  would  expect  1. 

Therefore,  we  use  a  weighted  (by  -j^)  sum  of  the  currently  estimated  value  of  QQ{x,a)  over  A  to 
approximate  Vf{x)  (see  Equation  (5)).  Ensuring  that  the  weighted  sum  concentrates  on  a*  as  the 
sampling  proceeds  will  ensure  that  in  the  limit  the  estimate  of  Vff{x)  converge  to  Vf{x). 

2.2  Algorithm  description 

We  now  provide  a  high-level  description  of  the  adaptive  multi-stage  sampling  (AMS)  algorithm  to 
estimate  Vf{x)  for  a  given  state  x  in  Eigure  1.  The  inputs  to  AMS  are  a  state  x  G  X,  Ni  >  \A\,  and 
stage  i,  and  the  output  of  AMS  is  the  estimate  of  V*{x).  Whenever  we  encounter  V^^{y) 

for  a  state  y  G  X  and  stage  k  in  the  Initialization  and  Loop  portions  of  the  AMS  algorithm, 
we  need  to  call  AMS  recursively  (at  Equation  (2)  and  (5)).  The  initial  call  to  AMS  is  done  with 
i  =  0,  the  initial  state  x,  and  Nq  and  every  sampling  is  done  independently  of  the  previously 
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done  samplings.  To  help  understand  how  the  recursive  calls  are  made  sequentially,  in  Figure  2, 
we  graphically  illustrate  the  sequence  of  calls  with  two  actions  and  H  =  3  for  the  Initialization 
portion. 

The  AMS  algorithm  is  a  recursive  extension  of  the  UCBl  algorithm  given  in  Auer,  Cesa-Bianchi, 
and  Fischer  (2002)  in  the  context  of  the  MDP  framework.  It  is  based  on  the  index-based  policy  of 
Agrawal  (1995),  where  the  index  for  an  action  is  given  by  the  sum  of  the  current  estimate  of  the 
true  Q-value  for  the  action  plus  a  term  that  relates  the  size  of  the  upper  confidence  bound. 

Adaptive  Multi-stage  Sampling  (AMS) 

•  Input:  a  state  x  G  X ,  Ni  >  \A\,  and  stage  i.  Output: 

•  Initialization:  Sample  each  action  a  G  A  sequentially  once  at  state  x  and  set 


Qi{x,a)  = 


0  ii  i  =  H  and  go  to  Exit 

A(x,a)  +  7F+r(y) 


where  y  is  the  sampled  next  state  with  respect  to  P{x,a),  and  set  n  =  |A|. 

Loop:  Sample  sequentially  each  action  a*  that  achieves 

max(^Qd*,«)  +  y^^, 

where  N^  i  is  the  number  of  times  action  a  has  been  sampled  so  far,  and  h  is  the 
overall  number  of  samples  done  so  far  for  this  stage,  and  Qi  is  dehned  by 

Qi{x,a)  ^  R{x,a)  +  ^  F+i+'(y), 

a,i 

where  Sa  is  the  set  of  sampled  next  states  so  far  with  |S'a|  =  Na,i  with  respect  to 
the  distribution  P{x,a). 

—  Update  ^  <—  Aj.  ^  +  1  and  S%<,  <—  U  {y'},  where  y'  is  the  newly  sampled 
next  state  by  a* . 

—  Update  Qi{x,a*)  with  the  V^i^{y')  value. 

—  h^h-|-l.  Ifh  =  Ni,  then  exit  Loop. 


•  Exit:  Set  UA*  (®)  such  that 


_j  EaeA  ^  ^ 


Q  A  i  =  H. 


and  return  uA*(a;). 


Figure  1:  Adaptive  multi-stage  sampling  algorithm  (AMS)  description 


The  running  time-complexity  of  the  AMS  algorithm  is  0{{\A\N)^)  with  N  =  max*  Aj.  To  see 
this,  let  Mi  be  the  number  of  recursive  calls  to  make  to  compute  at  the  worst  case.  At  stage 
i,  AMS  makes  |A|Mj_|_i  recursive  calls  in  Initialization  and  |A|AjMj_|_i  calls  in  Loop  at  the  worst 
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Figure  2:  Graphical  illustration  of  the  sequence  of  the  recursive  calls  made  in  Initialization  of  the 
AMS  algorithm.  Each  circle  corresponds  to  a  state  and  each  arrow  with  noted  action  signifies  a 
sampling  (and  a  recursive  call) .  The  bold-face  number  near  each  arrow  is  the  sequence  number  for 
the  recursive  calls  made.  For  simplicity,  the  entire  Loop  process  is  signified  by  one  call  number. 

case.  Therefore,  M*  =  (|A|  -|-  |A|Aj)Mj_|_i  so  that  Mq  =  0((|A|  -|-  lAlA^)-^)  =  0((|A| In 
contrast,  backward  induction  has  0(A|A||Ap)  running  time-complexity  (see,  e.g.,  Blondel  2000). 
Therefore,  the  main  benefit  of  AMS  is  independence  from  the  state  space  size,  but  this  comes  at 
the  expense  of  exponential  (versus  linear,  for  backwards  induction)  dependence  on  both  the  action 
space  and  the  horizon  length. 

2.3  Creating  an  on-line  stochastic  policy 

Once  armed  with  an  algorithm  that  estimates  the  optimal  value  for  finite  horizon  problems,  we  can 
create  a  nonstationary  stochastic  policy  in  an  on-line  manner  in  the  context  of  “planning”  (see, 
e.g.,  Kearns,  Mansour,  and  Ng  2001).  Suppose  at  time  t  >  0,  we  are  at  state  x  G  X.  We  evaluate 
each  action’s  utility  as  follows: 

R{x,  a)  +  7^  ^  V;^i+'(y),  a  G  A,  (6) 

*  yess 

where  we  apply  the  AMS  algorithm  at  the  sampled  next  states  for  the  stage  t  +  1.  We  simply  take 
the  action  that  achieves  the  maximum  utility.  We  remark  that  the  use  of  common  random  numbers 
(see,  e.g..  Law  and  Kelton  2000)  across  actions  in  the  utility  measures  given  by  Equation  (6)  should 
reduce  the  variance  in  the  spirit  of  “differential  training”  in  the  rollout  algorithm  (Bertsekas  1997). 

If  we  replace  the  horizon  H  —  1  hy  t  +  H  in  the  definition  of  in  the  above  equation  (6), 

the  resulting  stochastic  policy  yields  an  (approximate)  receding  77-horizon  control  (Hernandez- 
Lerma  and  Lasserre  1990)  for  the  infinite  horizon  problem. 


An  Adaptive  Sampling  Algorithm  for  Solving  MDPs 


3  Convergence  Analysis 


In  this  section,  we  prove  the  convergence  of  the  AMS  algorithm  and  show  that  the  worst  possible 
bias  converges  to  zero  at  rate  O  T^)- 


Theorem  3.1  Let  i?max  =  sup^  ^  R(x,  a)  and  assume  that  .Rmax  <  jj-  Suppose  AMS  is  run  with 
the  input  Ni  for  stage  i  =  0,  ...,H  —  1  and  an  arbitrary  initial  state  x  G  X.  Then 


lim  lim 

A^l— >CX 


lim 


iVfi 


E[Vo^°{x)]  =  Vf{x) 


Proof  of  Theorem  3.1: 

We  start  with  a  convergence  result  for  the  one-stage  approximation.  Consider  the  following 
one-stage  sampling  algorithm  (OSA)  in  Figure  3  with  a  stoehastie  value  funetion  U  defined  over  X. 
U{x)  for  X  G  A  is  a  nonnegative  random  variable  with  unknown  distribution  and  bounded  above 
for  all  X  G  X.  We  will  denote  U{x)  as  a  (random)  sample  from  the  unknown  distribution  associated 
with  U{x).  As  before,  every  sampling  is  done  independently  and  we  are  assuming  that  there  is  a 
black  box  that  returns  U{x)  once  x  is  given  to  the  black  box.  Let 


Crna.x  — 


'max  —  sup 
x,a 


R{x,  a)  -h  7  V  P{x,  a){y)E[U (y)] 


yex 


]■ 


J 


and  assume  for  the  moment  that  Umax  <  1- 

We  state  a  key  lemma  that  will  be  used  to  prove  the  convergence  of  the  AMS  algorithm. 


Lemma  3.1  Given  a  stoehastie  value  funetion  U  defined  over  X  with  Umax  <  1?  suppose  we  run 
OSA  with  the  input  n.  Define  for  all  x  G  X, 


V{x)  =max  I  R{x,a)  +  -/'^P{x,a){y)E[U{y)] 

y&x  / 


Then,  for  all  x  G  X , 


E\y"'{x)]  V(x)  as  n  ^  oo. 


Proof  of  Lemma  3.1: 

Fix  a  state  x  G  X  and  index  each  action  in  \A\  by  numbers  from  1  to  |A|.  Consider  an  |A|-armed 
bandit  problem  where  each  a  is  a  gambling  machine.  Successive  plays  of  machine  a  yield  “bandit 
rewards”  which  are  independent  and  identically  distributed  according  to  an  unknown  distribution 
6a  with  unknown  expectation 

Q{x,a)  =  R{x,a)  P{x,a){y)E[U{y)], 

yex 

and  are  independent  across  machines  or  actions. 
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One-stage  Sampling  Algorithm  (OSA) 

•  Input:  a  state  x  G  X  and  n>  \  A\. 

•  Initialization:  Sample  each  action  a  G  A  once  at  state  x  and  set 

Q[x,a)  =  R[x,a)  +7t/(j/), 


where  y  is  the  sampled  next  state  with  respect  to  P{x,a),  and  set  n  =  |A|. 
•  Loop:  Sample  each  action  a*  that  achieves 


max 

a^A 


^Q(a;,a)  + 


where  (h)  is  the  number  of  times  action  a  has  been  sampled  so  far 
at  state  x,  n  is  the  overall  number  of  samples  done  so  far,  and  Q  is 
defined  by 

Q{x,a)  =  R{x,a)+'y——  ^  U(y), 

“  yeAJ 

where  is  the  set  of  sampled  next  states  so  far  with  |A^|  =  T^{n)  with 
respect  to  the  distribution  P{x,a). 


—  Update  (n)  <—  (n)  +  1  and  A^.  <—  A^.  U  {y'},  where  y'  is  the  newly 

sampled  next  state  by  a* . 

—  Update  Q{x,a*)  with  U{y'). 

—  h<— h+l.  If  h  =  n,  then  exit  Loop. 


•  Exit:  Set  V"  such  that 

=  (7) 

^  n 

aGA 


Figure  3:  One-stage  sampling  algorithm  (OSA)  description 


The  term  T^{n)  signifies  the  number  of  times  machine  a  has  been  played  (or  action  a  has  been 
sampled)  by  OSA  during  the  n  plays.  Define  the  expected  regret  p{n)  of  OSA  after  n  plays  by 

|A| 

p{n)  =  V{x)n—y  Q{x,a)E\T^{n)\,  where  F(x)  =  max  (5(x,  a). 

a^A 

a=l 

Applying  Theorem  1  from  Auer,  Cesa-Bianchi,  and  Fischer  (2002)  gives  the  following  bound  on 
p{n): 


Theorem  3.2  For  all  |A|  >  1,  if  OSA  is  run  on  \A\-machines  having  arbitrary  bandit  reward 
distribution  (5i,...,5|^l  with  Umax  <  1? 


pin)  < 


E 

a:Q(x,a)<V  (x) 


8  Inn 


TT 


[Vix)  -  Qix,  a) 


+  (1  +  -  Q{x,a)) 


where  Qix,  a)  is  the  expected  value  of  bandit  rewards  with  respect  to  da- 
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See  Auer,  Cesa-Bianchi,  and  Fischer  (2002)  for  a  proof  of  the  above  theorem.  Observe  that 
maXa(F(x)  —  Q(x,a))  <  C/max-  Let  =  {a\Q{x,a)  <  V{x),a  G  A},  i.e.,  the  set  of  nonoptimal 
actions  for  x.  Define  a{x)  for  4>{x)  /  0  such  that 

a{x)  =  min  {V{x)  —  Q{x,  a))  (8) 

a£<p{x) 

and  note  that  0  <  a{x)  <  flmax-  Define 

F(x)  = 


Applying  Theorem  3.2,  we  have 


0  <  FW  -  £[FW1  =  dd  <  +  (1  +  =!) . 

n  na{x)  3  n 


Note  also  that  p{n)  =  0  if  <j){x)  =  0. 


From  the  definition  of  D”(x)  given  by  Equation  (7),  it  follows  that 


V{x)  -  E[V^{x)]  =  V{x)  -  E[V{x)  -  V{x)  +  V^(x)] 


V {x)  -  E[V {x)]  +  E  (^Q{x,a)  -  Q{x,a)^ 

.aeA 


Letting  n  — >  oo,  the  first  term  V{x)  —  E[V{x)]  is  bounded  by  zero  from  above  with  convergence 
rate  of  0{^^)  by  Equation  (9).  We  show  now  that  the  second  expectation  term  is  zero. 

Eirst  observe  that  T^{n)  for  every  finite  n  is  a  stopping  time  (see,  e.g.,  Ross  1995,  p.l04)  with 
E[T^{n)]  <  n  <  oo.  Let  pa  =  -P(a:^,a)(y)E[t/(y)]. 


E  I  R(x,  a)  +  -Iita  -  R{x,  a)  -  T  (  ^ (S')! 


E  E  u{y) 


\a^A  /  |_aGA  \y^A^ 

=  DE^Kf")!".) E'^fe) 

VaeA  /  aeA  y&A% 

=  0  by  applying  Wald’s  equation. 


Since 


V{x)  -  E[V^{x)]  =  V{x)  -  E[V{x)], 


the  convergence  follows  directly  from  Equation  (9). 
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Therefore,  because  x  was  chosen  arbitrarily,  we  have  that  for  all  x  ^  X, 

E[V"'{x)]  ^V{x)  as  n  ^  oo, 

which  concludes  the  proof  of  Lemma  3.1.  I 


We  now  return  to  the  AMS  algorithm.  From  the  definition  of  ^ , 


vS-i\x)  =  I  R{x,a)  +  -i 


2&A 


Nh- 


a,H-l  yfzs-. 


E  '(S') 


<  'Yj  ~\f -  (-Rmax  +7-0)  =  i?max)  X  ^  X. 


2&A 


Nh- 


Similarly  for  Vjr^o  have  that 


< 


a&A 


E 

a£A 


Xh-2 

Xh-2 


R{x,a)  +7 


N: 


a,H—2 


Y 


y&ss 


{R 

max  +  7-R  max  )  =  .Rmax(l  +7),X  €  X. 


Continuing  this  backwards,  we  have  for  all  x  G  A  and  i  =  0, ...,  H  —  1, 

H-i-l 

<  i?^ax  7'  <  RmUH  -  i)  <  1, 

3=0 

where  the  last  inequality  comes  from  the  assumption  that  R^s,^H  <  1. 

Therefore,  from  Lemma  3.1  with  Umax  =  Rraax{H  —  i)  <  1,  we  have  for  i  =  0, ...,  H  —  1,  and  for 
arbitrary  x  G  A, 


max  i?(x,a)  +  7  ^  P{x,a){y)E[V^Y\y)\ 
But  for  arbitrary  x  G  A,  because  V^^{x)  =  V^{x)  =  0,x  G  A, 


which  in  turn  leads  to  E[vJ^^2  ^{x)]  V^_2{x)  as  Nh-2  co  for  arbitrary  x  G  A,  and  by  an 

inductive  argument,  we  have  that 


lim  lim  •  •  •  lim  E[Vf^°{x)]  =  Fo*(x)  for  all  x  G  A, 

Nq^qo  Ni^oo  >00 


which  concludes  the  proof  of  Theorem  3.1. 


I 
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We  now  argue  that  the  worst  possible  bias  by  AMS  is  bounded  by  a  quantity  that  converges  to 
zero  at  rate  O  be  the  space  of  real-valued  bounded  measurable  functions 

on  X  endowed  with  the  supremum  norm  ||d>||  =  sup^.  |d>(x)|  for  G  B{X).  We  define  an  operator 
T  :  B{X)  B{X)  as 


T(<h)(x)  =  max  <  R{x,  a)  -|-  7  ^  P(x,  a){y)^{y)  >  ,  G  B{X),x  G  X. 

q-G-A  I  * 

y&X 


(11) 


Define  'hj  G  B{X)  such  that  =  E[v/^"{x)]  for  all  x  G  A  and  i  =  0,  ...,H  —  1  and  = 

V^{x)  =  0,x  G  A.  In  the  proof  of  Lemma  3.1  (see  Equation  (10)),  we  showed  that  for  i  = 


r(^i+i)(x)  -  ^i(x)  <  o 


In  A, 
Ni 


,  X  G  A. 


Therefore,  we  have 


and 


r('Li)(x)-'Lo(^)<0(^),xGA. 
'hi(x)>T(d/2)(x)-0(i^^),xGA. 

iVi 


(12) 


(13) 


Applying  the  T-operator  to  both  sides  of  Equation  (13)  and  using  the  monotonicity  property  of  T 
(see,  e.g.,  Bertsekas  1995),  we  have 


r(^i)(x)  >r2(^2)(x)  -o 


InAi 


,  X  G  A. 


(14) 


Therefore,  combining  Equation  (12)  and  (14)  yields 

^  T  ^  In  An  In  Ai  \ 

T  (>I-2)(i)  -  <  O  (—  +  —  j  , 


X  G  A. 


Repeating  this  argument  yields 


^H-l 


In  A- 


T^{^h){x)  -  ^o{x)  <  O  (  ^  )  ,x  G  A. 


,  i=0 


(15) 


Observe  that  T^{'^h){x)  =  Eq*(x),x  G  A.  Rewriting  Equation  (15),  we  finally  have 


^H-l 


In  A- 


V^{x)  -  E[V^%x)]  <  O  (  ^  ]  ,x  G  A, 


,  i=0 


and  we  know  that  Vq{x)  —  A[V'q^°(x)]  >  0,x  G  A.  Therefore,  it  implies  that  the  worst  possible  bias 

is  bounded  by  the  quantity  that  converges  to  zero  at  rate  O  T^)- 

We  remark  that  we  can  relax  the  assumption  i?max  <  ,  by  a  normalization  of  the  given  reward 

function.  The  upper  bound  in  Theorem  3.2  for  p{n)  needs  to  be  modified  with  a  different  bounded 
2 

constant  from  1  which  can  be  achieved  by  the  Hoeffding  inequality  with  support  in  [0,  Rmax  A] 
rather  than  in  [0, 1].  Therefore,  the  assumption  of  the  support  in  [0, 1]  is  not  crucial  (Cesa-Bianchi 
and  Eisher  1998). 
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4  Comparison  with  a  Nonadaptive  Algorithm 

Consider  the  following  recursive  definition:  given  fixed  >  0, 

V/^{x)  =  I  R{x,a)  +  ^  )  C  =  0,  ...,H  -  l,x  gX, 

with  (x)  =  0  for  all  x  G  X,  where  17^  is  the  multiset  of  sampled  states  with  respect  to  P{x,  a) 
and  |f7^|  =  N  for  all  a  G  4  and  x  G  X. 

The  above  recursive  definition  immediately  suggests  the  following  nonadaptive  multi-stage  sam¬ 
pling  (NMS)  algorithm.  NMS  creates  a  random  sample-path  tree  having  the  depth  of  H  and 
branching  factor  of  A^|4|  in  forward  manner,  where  N  is  the  fixed  number  of  next  states  to  be 
sampled  from  each  sampled  state  in  the  sample-path  tree  for  each  action,  and  then  in  backward 
manner,  the  estimate  value  of  lg*-value  or  t^'^-value  is  computed  recursively  (see  Kearns,  Man- 
sour,  and  Ng  2001  for  detailed  description  and  a  performance  analysis  of  NMS  for  infinite  horizon 
discounted  criterion).  Note  that  the  running  time-complexity  of  NMS  is  O  (^{\A\N)^^ ,  which  is 
similar  to  that  of  AMS  at  the  worst  case,  and  NMS  is  asymptotically  unbiased  in  the  sense  that 
as  N  ^  oo,  E[Vf^ (x)]  lo*(x)  simply  by  the  law  of  large  numbers.  This  motivates  us  to  compare 
the  convergence  rates  between  AMS  and  NMS. 

Via  the  Hoeffding  inequality,  it  is  straightforward  to  establish  that  for  ah.  x  G  X  and  e  >  0, 

Pt{\Vo*{x)  -  Vq^{x)\  >e}  < 

with  assumption  that  RmaxH  <  1.  (See  Lemma  3  and  4  in  (Kearns,  Mansour,  and  Ng  2001)  with 
appropriate  modifications  in  the  context  of  the  discounted  finite-horizon  total  reward  criterion.) 
Because  application  of  the  Hoeffding  inequality  to  obtain  the  expected  performance  error  does  not 
provide  any  useful  information,  we  use  the  upper  bound  Markov  inequality  (see,  e.g.,  Hofri  1995, 
p.574):  for  a  nonnegative  bounded  random  variable  K,  for  any  e  >  0, 

EK  <  supiL  •  Pr{iL  >  e}  -|-  e. 

It  follows  that  with  e  >  0, 

E\V^{x)  -  %^{x)\  <  RmaxH  ■  2(V|A|)^e-2^^'/^'  +  e. 

Therefore,  to  make  the  expected  error  go  to  zero,  we  need  to  select  an  arbitrarily  close  to  zero  value 
of  e  with  N  ^  oo.  However,  the  choice  of  e  will  make  the  exponential  term  in  the  denominator 
almost  constant  even  with  a  very  large  N.  Therefore,  we  expect  that  the  convergence  rate  of  the 
nonadaptive  algorithm  will  be  much  slower  than  the  convergence  rate  of  AMS  even  with  a  value  of 
e  not  that  close  to  zero  in  practice  (e.g.,  due  to  the  exponential  dependence  on  the  horizon  size  in 
the  numerator)  even  though  the  main  benefit  of  the  NMS  algorithm  would  be  independence  from 
the  state  space  like  AMS. 
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5  Concluding  Remarks 

To  the  best  of  our  knowledge,  this  is  the  first  work  applying  the  theory  of  the  multi-armed  bandit 
problem  to  derive  a  provably  convergent  algorithm  for  solving  general  finite-horizon  MDPs.  The 
closest  related  work  is  probably  that  of  Agrawal,  Teneketzis,  and  Anatharam  (1989),  who  considered 
a  controlled  Markov  chain  problem  with  finite  state  and  action  spaces,  where  transition  probabilities 
and  initial  distribution  are  parameterized  by  an  unknown  parameter  belonging  to  some  known 
finite  parameter  space  and  each  Markov  chain  induced  from  each  fixed  parameter  is  irreducible  and 
aperiodic.  Defining  a  loss  function  based  on  the  regret  of  Lai  and  Robbins  (1985),  they  provide  an 
“asymptotically  efficient”  adaptive  but  complex  control  scheme  that  works  well  for  all  parameters 
such  that  the  loss  associated  with  the  control  scheme  is  equal  to  the  lower  bound  on  the  loss  function 
asymptotically  (as  we  apply  the  scheme  over  infinite  number  of  time  steps).  The  adaptiveness  comes 
from  the  use  of  the  multi-armed  bandit  theory  for  the  stationary  control  laws.  In  other  words,  the 
arm  corresponds  to  a  particular  stationary  law  or  policy,  but  not  a  particular  action  in  the  action 
space.  We  believe  that  extending  the  AMS  algorithm  within  the  context  of  Agrawal,  Teneketzis, 
and  Anatharam  (1989)  is  not  difficult,  achieving  a  uniform  rather  than  asymptotic  result  over 
time-steps. 

We  assumed  without  loss  of  generality  that  |A|  >  1,  because  problems  for  which  \A\  =  1, 
or  the  number  of  the  admissible  actions  for  some  states  is  one,  can  be  solved  by  the  following 
transformation  of  M  =  (A,  A,  P,R)  to  an  equivalent  M'  =  (A',  A\R\P')  as  follows.  We  augment 
A  with  an  extra  state  x  and  A  with  an  extra  action  d  such  that  A'  =  A  U  {x}  and  A'  =  AU  {a}. 
The  state  transition  function  P'  is  defined  such  that  P'{x,a){y)  =  P{x,a){y)  for  all  x,y  £  X  and 
a  £  A,  and  P'{x,a){x)  =  1  for  all  a  £  A',  and  for  all  x  G  A,  P'{x,a){x)  =  1  if  a  =  a  and  0  if 
not.  The  reward  function  R'  is  defined  such  that  R'{x,a)  =  R{x,a)  for  all  x  G  A  and  a  G  A,  and 
R'(x,  a)  =  0  for  all  a  £  A'.  Note  that  x  is  just  a  sink  state  that  is  only  reachable  by  the  action  d 
from  all  states  in  A  and  d  is  always  a  suboptimal  action  at  each  state.  It  is  left  for  the  reader  to 
check  that  the  transformed  MDP  M'  gives  the  same  optimal  action  at  each  state  in  A  as  M  and 
the  same  optimal  value  at  each  state  in  A  as  M. 

For  the  actual  implementation  of  the  AMS  algorithm,  we  can  use  the  same  value  N  =  Ni, 
i  =  0,  and  we  can  improve  the  running  time-complexity  of  AMS  in  heuristic  ways  as  follows. 

Given  an  MDP  M  =  {X,  A,  P,  R),  consider  the  transformation  of  M  into  M'  =  (A',  A',  P',  ii')  as 
discussed  in  the  previous  paragraph.  Suppose  that  we  add  the  following  structure  to  M'  on  the 
state  transition  function  P' .  P'(x,a)(x)  is  arbitrarily  close  to  0  (instead  of  0)  for  all  x  G  A  and 
a  £  A  with  a  proper  normalization  of  P(x,  a){y)  for  y  £  X.  That  is,  each  state  x  G  A  can  reach  the 
sink  state  with  close  to  zero  probability.  Then  solving  the  newly  defined  MDP  is  almost  equivalent 
to  the  original  MDP  M.  This  speculation  suggests  that  when  we  apply  AMS  for  a  state  x  G  A, 
in  the  Initialization  step,  we  just  set  Qi{x,  a)  =  R{x,  a)  for  i  =  0, ...,  H  —  1,  pretending  that  the 
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sampled  next  state  is  the  sink  state,  eliminating  O(N^)  computations.  Furthermore,  if  7  7^  1,  we 
can  set,  e.g.,  Nq  =  n  and  Ni  =  [7*nJ  for  i  >  1  heuristically,  incorporating  the  discounting  nature. 

We  can  extend  the  AMS  algorithm  to  include  the  case  where  the  reward  function  is  random. 
The  AMS  algorithm  would  essentially  remain  identical,  except  that  sampling  would  now  include 
both  the  next  state  and  the  one-stage  reward.  However,  the  convergence  proof  is  likely  to  require 
more  technical  manipulations.  Furthermore,  the  assumption  of  bounded  rewards  can  be  relaxed  by 
using  the  result  in  Agrawal  (1995).  Even  though  the  AMS  algorithm  will  converge  too  in  this  case, 
unfortunately,  we  lose  the  property  of  the  uniform  logarithmic  bound  so  that  the  convergence  rate 
is  expected  to  be  very  slow. 

We  can  use  the  AMS  algorithm  to  approximate  the  optimal  infinite  horizon  discounted  average 
reward  and  the  infinite  horizon  average  reward  under  an  ergodicity  assumption  by  (approximately) 
solving  a  finite-horizon  MDP.  Deriving  an  expected  error  bound  for  each  case  is  straightforward. 

Earlier  work  of  Cesa-Bianchi  and  Fischer  (1998)  proposed  several  algorithms  that  achieve  the 
regret  bounds  of  the  form  ci  -|-  C2logn  -|-  cslog^n,  where  n  is  the  total  number  of  plays  and  Cj’s 
are  positive  constants  not  depending  on  n.  These  algorithms  might  also  be  used  to  create  adaptive 
sampling  algorithms  for  solving  MDPs.  However,  those  algorithms  have  the  drawback  that  we  need 
to  know  the  exact  value  of  a{x)  for  a  given  state  x  under  the  assumption  that  not  all  of  the  actions 
are  optimal,  which  is  difficult  to  obtain  in  advance.  This  holds  also  for  other  algorithms  studied  in 
Auer,  Cesa-Bianchi,  and  Fischer  (2002). 

The  alert  reader  might  wonder  what  happens  if  we  replace  the  weighted  sum  of  the  Q-value 
estimates  in  Equation  (5)  by  the  maximum  of  the  estimates  instead  of  the  weighted  sum.  We  expect 
that  the  resulting  algorithm  will  also  converge  to  the  true  optimal  value.  However,  to  analyze  this 
we  need  to  know  how  the  distribution  of  the  maximum  of  the  estimates  changes  while  running  the 
algorithm,  which  would  be  very  difficult.  The  proof  of  the  convergence  the  resulting  algorithm  is 
an  open  problem. 

Note  1  Throughout  the  paper,  the  notation  O  used  in  the  sense  that  for  given  two  functions  / 
and  g,  f{n)  =  0{g{n))  if  lim„^oo  =  c  for  some  constant  c  >  0,  and  the  notation  0  is  used 
in  that  there  exist  positive  constants  ci,  C2,  and  no  such  that  0  <  cig{n)  <  f{n)  <  C2g{n)  for  all 
n  >  no  (Cormen,  Leiserson,  and  Rivest  1990).  The  O  and  0-notations  are  often  called  asymptotic 
upper  bound  and  asymptotically  tight  bound,  respectively  for  the  asymptotic  running  time  of  an 
algorithm. 
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